AITopics | detection transformer

Collaborating Authors

detection transformer

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Uni3DETR: Unified 3D Detection Transformer

Neural Information Processing SystemsFeb-15-2026, 11:29:17 GMT

Existing point cloud based 3D detectors are designed for the particular scene, either indoor or outdoor ones.

artificial intelligence, detection, machine learning, (17 more...)

Neural Information Processing Systems

Country:

South America > Brazil (0.04)
Asia > China > Hong Kong (0.04)

Industry: Information Technology (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
Information Technology > Artificial Intelligence > Vision (0.73)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Add feedback

Uni3DETR: Unified 3D Detection Transformer

Neural Information Processing SystemsOct-8-2025, 23:25:56 GMT

Existing point cloud based 3D detectors are designed for the particular scene, either indoor or outdoor ones.

artificial intelligence, detection, machine learning, (17 more...)

Neural Information Processing Systems

Country:

South America > Brazil (0.04)
Asia > China > Hong Kong (0.04)

Industry: Information Technology (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
Information Technology > Artificial Intelligence > Vision (0.73)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Add feedback

Detection Transformers Under the Knife: A Neuroscience-Inspired Approach to Ablations

Hütten, Nils, Hölken, Florian, Tercan, Hasan, Meisen, Tobias

arXiv.org Artificial IntelligenceJul-30-2025

In recent years, Explainable AI has gained traction as an approach to enhancing model interpretability and transparency, particularly in complex models such as detection transformers. Despite rapid advancements, a substantial research gap remains in understanding the distinct roles of internal components - knowledge that is essential for improving transparency and efficiency. Inspired by neuroscientific ablation studies, which investigate the functions of brain regions through selective impairment, we systematically analyze the impact of ablating key components in three state-of-the-art detection transformer models: Detection transformer (DETR), deformable detection transformer (DDETR), and DETR with improved denoising anchor boxes (DINO). The ablations target query embeddings, encoder and decoder multi-head self-attentions (MHSA) as well as decoder multi-head cross-attention (MHCA) layers. We evaluate the effects of these ablations on the performance metrics gIoU and F1-score, quantifying effects on both the classification and regression sub-tasks on the COCO dataset. To facilitate reproducibility and future research, we publicly release the DeepDissect library. Our findings reveal model-specific resilience patterns: while DETR is particularly sensitive to ablations in encoder MHSA and decoder MHCA, DDETR's multi-scale deformable attention enhances robustness, and DINO exhibits the greatest resilience due to its look-forward twice update rule, which helps distributing knowledge across blocks. These insights also expose structural redundancies, particularly in DDETR's and DINO's decoder MHCA layers, highlighting opportunities for model simplification without sacrificing performance. This study advances XAI for DETRs by clarifying the contributions of internal components to model performance, offering insights to optimize and improve transparency and efficiency in critical applications.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2507.21723

Genre: Research Report > New Finding (0.66)

Industry: Health & Medicine > Therapeutic Area > Neurology (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)

Add feedback

BOOTPLACE: Bootstrapped Object Placement with Detection Transformers

Zhou, Hang, Zuo, Xinxin, Ma, Rui, Cheng, Li

arXiv.org Artificial IntelligenceMar-27-2025

In this paper, we tackle the copy-paste image-to-image composition problem with a focus on object placement learning. Prior methods have leveraged generative models to reduce the reliance for dense supervision. However, this often limits their capacity to model complex data distributions. Alternatively, transformer networks with a sparse contrastive loss have been explored, but their over-relaxed regularization often leads to imprecise object placement. We introduce BOOTPLACE, a novel paradigm that formulates object placement as a placement-by-detection problem. Our approach begins by identifying suitable regions of interest for object placement. This is achieved by training a specialized detection transformer on object-subtracted backgrounds, enhanced with multi-object supervisions. It then semantically associates each target compositing object with detected regions based on their complementary characteristics. Through a boostrapped training approach applied to randomly object-subtracted images, our model enforces meaningful placements through extensive paired data augmentation. Experimental results on established benchmarks demonstrate BOOTPLACE's superior performance in object repositioning, markedly surpassing state-of-the-art baselines on Cityscapes and OPA datasets with notable improvements in IOU scores. Additional ablation studies further showcase the compositionality and generalizability of our approach, supported by user study evaluations.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2503.21991

Country: North America > Canada > Alberta (0.14)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (0.96)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
Information Technology > Artificial Intelligence > Natural Language (0.88)

Add feedback

Lung-DETR: Deformable Detection Transformer for Sparse Lung Nodule Anomaly Detection

Ramezani, Hooman, Aleman, Dionne, Létourneau, Daniel

arXiv.org Artificial IntelligenceSep-8-2024

Accurate lung nodule detection for computed tomography (CT) scan imagery is challenging in real-world settings due to the sparse occurrence of nodules and similarity to other anatomical structures. In a typical positive case, nodules may appear in as few as 3% of CT slices, complicating detection. This paper presents a novel approach to lung tumor detection in CT data by framing the task as anomaly detection, targeting rare nodule occurrences in a predominantly normal dataset. Our novel method, named Lung-DETR combines Deformable Detection Transformer, Focal Loss, and Maximum Intensity Projection into a unified framework for sparse lung nodule detection. A 7.5mm Maximum Intensity Projection (MIP) is utilized to combine adjacent lung slices, decreasing nodule sparsity and enhancing spatial context to allow for better differentiation between nodules, bronchioles, and other complex vascular structures. Lung-DETR is trained with a custom focal loss function to better handle the imbalanced dataset, and outputs bounding boxes around detected nodules. Our model achieves an F1 score of 94.2% (95.2% recall, 93.3% precision) on the LUNA16 dataset, with test dataset nodule sparsity of 4% that is reflective of real-world clinical data.

dataset, detection, nodule, (13 more...)

arXiv.org Artificial Intelligence

2409.052

Country:

North America > Canada > Ontario > Toronto (0.29)
Africa > Mali (0.04)

Genre:

Research Report > Promising Solution (0.54)
Research Report > Experimental Study (0.48)

Industry:

Health & Medicine > Diagnostic Medicine > Imaging (1.00)
Health & Medicine > Nuclear Medicine (0.94)

Technology:

Information Technology > Data Science > Data Mining > Anomaly Detection (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.72)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

FS-DETR: Few-Shot DEtection TRansformer with prompting and without re-training

Bulat, Adrian, Guerrero, Ricardo, Martinez, Brais, Tzimiropoulos, Georgios

arXiv.org Artificial IntelligenceAug-20-2023

This paper is on Few-Shot Object Detection (FSOD), where given a few templates (examples) depicting a novel class (not seen during training), the goal is to detect all of its occurrences within a set of images. From a practical perspective, an FSOD system must fulfil the following desiderata: (a) it must be used as is, without requiring any fine-tuning at test time, (b) it must be able to process an arbitrary number of novel objects concurrently while supporting an arbitrary number of examples from each class and (c) it must achieve accuracy comparable to a closed system. Towards satisfying (a)-(c), in this work, we make the following contributions: We introduce, for the first time, a simple, yet powerful, few-shot detection transformer (FS-DETR) based on visual prompting that can address both desiderata (a) and (b). Our system builds upon the DETR framework, extending it based on two key ideas: (1) feed the provided visual templates of the novel classes as visual prompts during test time, and (2) ``stamp'' these prompts with pseudo-class embeddings (akin to soft prompting), which are then predicted at the output of the decoder. Importantly, we show that our system is not only more flexible than existing methods, but also, it makes a step towards satisfying desideratum (c). Specifically, it is significantly more accurate than all methods that do not require fine-tuning and even matches and outperforms the current state-of-the-art fine-tuning based methods on the most well-established benchmarks (PASCAL VOC & MSCOCO).

detection, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2210.04845

Country: North America > United States (0.14)

Genre: Research Report > New Finding (0.46)

Industry: Energy > Oil & Gas > Upstream (0.62)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Add feedback

Focused Decoding Enables 3D Anatomical Detection by Transformers

Wittmann, Bastian, Navarro, Fernando, Shit, Suprosanna, Menze, Bjoern

arXiv.org Artificial IntelligenceFeb-26-2023

Detection Transformers represent end-to-end object detection approaches based on a Transformer encoder-decoder architecture, exploiting the attention mechanism for global relation modeling. Although Detection Transformers deliver results on par with or even superior to their highly optimized CNN-based counterparts operating on 2D natural images, their success is closely coupled to access to a vast amount of training data. This, however, restricts the feasibility of employing Detection Transformers in the medical domain, as access to annotated data is typically limited. To tackle this issue and facilitate the advent of medical Detection Transformers, we propose a novel Detection Transformer for 3D anatomical structure detection, dubbed Focused Decoder. Focused Decoder leverages information from an anatomical region atlas to simultaneously deploy query anchors and restrict the cross-attention's field of view to regions of interest, which allows for a precise focus on relevant anatomical structures. We evaluate our proposed approach on two publicly available CT datasets and demonstrate that Focused Decoder not only provides strong detection results and thus alleviates the need for a vast amount of annotated data but also exhibits exceptional and highly intuitive explainability of results via attention weights. Our code is available at https://github.com/bwittmann/transoar.

artificial intelligence, focused decoder, machine learning, (15 more...)

arXiv.org Artificial Intelligence

doi: 10.59275/j.melba.2023-35e6

2207.10774

Country:

Europe > Switzerland > Zürich > Zürich (0.14)
Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)

Genre: Research Report (1.00)

Industry:

Health & Medicine > Therapeutic Area (1.00)
Health & Medicine > Diagnostic Medicine > Imaging (0.93)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Li3DeTr: A LiDAR based 3D Detection Transformer

Erabati, Gopi Krishna, Araujo, Helder

arXiv.org Artificial IntelligenceOct-27-2022

Inspired by recent advances in vision transformers for object detection, we propose Li3DeTr, an end-to-end LiDAR based 3D Detection Transformer for autonomous driving, that inputs LiDAR point clouds and regresses 3D bounding boxes. The LiDAR local and global features are encoded using sparse convolution and multi-scale deformable attention respectively. In the decoder head, firstly, in the novel Li3DeTr cross-attention block, we link the LiDAR global features to 3D predictions leveraging the sparse set of object queries learnt from the data. Secondly, the object query interactions are formulated using multi-head self-attention. Finally, the decoder layer is repeated $L_{dec}$ number of times to refine the object queries. Inspired by DETR, we employ set-to-set loss to train the Li3DeTr network. Without bells and whistles, the Li3DeTr network achieves 61.3% mAP and 67.6% NDS surpassing the state-of-the-art methods with non-maximum suppression (NMS) on the nuScenes dataset and it also achieves competitive performance on the KITTI dataset. We also employ knowledge distillation (KD) using a teacher and student model that slightly improves the performance of our network.

artificial intelligence, detection, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2210.15365

Country: Europe > Portugal > Coimbra > Coimbra (0.04)

Genre: Research Report > Promising Solution (0.34)

Industry:

Education (0.54)
Information Technology (0.35)
Transportation > Ground > Road (0.35)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

A Hybrid System of Sound Event Detection Transformer and Frame-wise Model for DCASE 2022 Task 4

Li, Yiming, Guo, Zhifang, Ye, Zhirong, Wang, Xiangdong, Liu, Hong, Qian, Yueliang, Tao, Rui, Yan, Long, Ouchi, Kazushige

arXiv.org Artificial IntelligenceOct-17-2022

In this paper, we describe in detail our system for DCASE 2022 Task4. The system combines two considerably different models: an end-to-end Sound Event Detection Transformer (SEDT) and a frame-wise model, Metric Learning and Focal Loss CNN (MLFL-CNN). The former is an event-wise model which learns event-level representations and predicts sound event categories and boundaries directly, while the latter is based on the widely adopted frame-classification scheme, under which each frame is classified into event categories and event boundaries are obtained by post-processing such as thresholding and smoothing. For SEDT, self-supervised pre-training using unlabeled data is applied, and semi-supervised learning is adopted by using an online teacher, which is updated from the student model using the Exponential Moving Average (EMA) strategy and generates reliable pseudo labels for weakly-labeled and unlabeled data. For the frame-wise model, the ICT-TOSHIBA system of DCASE 2021 Task 4 is used. Experimental results show that the hybrid system considerably outperforms either individual model and achieves psds1 of 0.420 and psds2 of 0.783 on the validation set without external data. The code is available at https://github.com/965694547/Hybrid-system-of-frame-wise-model-and-SEDT.

artificial intelligence, event detection, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2210.09529

Country:

Europe > France > Grand Est > Meurthe-et-Moselle > Nancy (0.05)
Asia > China > Beijing > Beijing (0.05)

Genre: Research Report (0.84)

Industry: Education (0.49)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Unsupervised or Indirectly Supervised Learning (0.77)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.69)

Add feedback

Towards Data-Efficient Detection Transformers

Wang, Wen, Zhang, Jing, Cao, Yang, Shen, Yongliang, Tao, Dacheng

arXiv.org Artificial IntelligenceAug-24-2022

Detection Transformers have achieved competitive performance on the sample-rich COCO dataset. However, we show most of them suffer from significant performance drops on small-size datasets, like Cityscapes. In other words, the detection transformers are generally data-hungry. To tackle this problem, we empirically analyze the factors that affect data efficiency, through a step-by-step transition from a data-efficient RCNN variant to the representative DETR. The empirical results suggest that sparse feature sampling from local image areas holds the key. Based on this observation, we alleviate the data-hungry issue of existing detection transformers by simply alternating how key and value sequences are constructed in the cross-attention layer, with minimum modifications to the original models. Besides, we introduce a simple yet effective label augmentation method to provide richer supervision and improve data efficiency. Experiments show that our method can be readily applied to different detection transformers and improve their performance on both small-size and sample-rich datasets. Code will be made publicly available at \url{https://github.com/encounter1997/DE-DETRs}.

data efficiency, detection transformer, transformer, (12 more...)

arXiv.org Artificial Intelligence

2203.09507

Country: Asia > China > Anhui Province > Hefei (0.04)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (0.67)
(2 more...)

Add feedback